dcg_score (Discounted Cumulative Gain)#

dcg_score measures ranking quality: it rewards putting highly relevant items at the top, and it discounts lower ranks logarithmically.

If you care about “are the best things at the top?” (search, recommendations, multilabel ranking), DCG is a natural tool.

Learning goals#

  • understand the DCG@k definition and the log discount

  • build intuition with plots (per-rank contributions + cumulative curves)

  • implement dcg_score in NumPy (including tie-averaging like scikit-learn)

  • see how DCG/NDCG is used to tune a simple linear ranking model

  • know pros/cons and common pitfalls

Quick import#

from sklearn.metrics import dcg_score

Table of contents#

  1. Definitions and notation

  2. Intuition (plots)

  3. NumPy implementation (+ sanity checks vs sklearn)

  4. Using DCG/NDCG to optimize/tune a simple model

  5. Pros, cons, pitfalls

import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import dcg_score as sk_dcg_score
from sklearn.metrics import ndcg_score as sk_ndcg_score
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(0)

1) Definitions and notation#

Setup (one query)#

For one query (or one sample), assume we have \(m\) items (documents, products, labels, …).

  • True gain (a.k.a. relevance): \(y_j\) for item \(j\)
    (usually \(y_j \ge 0\), often an integer like 0/1/2/3)

  • Predicted score: \(s_j\) for item \(j\)

Let \(\pi\) be the permutation that sorts scores descending:

\[ s_{\pi_1} \ge s_{\pi_2} \ge \dots \ge s_{\pi_m} \]

Define a logarithmic discount at rank \(r\) (1-indexed):

\[ d_r = \frac{1}{\log_b(r+1)} \]
  • \(b\) is the log base (scikit-learn uses log_base=2 by default)

  • note \(d_1 = 1/\log_b(2) = 1\), so the top rank is not discounted

DCG@k#

Discounted Cumulative Gain at cutoff \(k\) is:

\[ \mathrm{DCG@}k(y, s) = \sum_{r=1}^{\min(k,m)} y_{\pi_r}\, d_r \]

Interpretation: each item’s gain is counted, but late ranks matter less.

Important note: what is “gain”?#

In classical Information Retrieval, people often transform relevance levels with an exponential gain, e.g.

\[ g(y) = 2^y - 1 \]

and compute DCG on \(g(y)\). scikit-learn’s dcg_score does not apply a gain transform: it assumes y_true already contains the gains you want.

Tie handling (scikit-learn default)#

If some predicted scores are tied, scikit-learn averages over all tie permutations. For a tied group \(T\) occupying ranks \(r..r+|T|-1\):

\[ \text{group contribution} = \Big(\frac{1}{|T|}\sum_{j\in T} y_j\Big) \cdot \sum_{t=r}^{r+|T|-1} d_t \]

This equals the expected DCG if you break ties uniformly at random. Set ignore_ties=True only if you are sure there are no ties.

Normalization (NDCG)#

Raw DCG depends on the scale of gains and the number of items. A common normalization is:

\[ \mathrm{NDCG@}k = \frac{\mathrm{DCG@}k}{\mathrm{IDCG@}k} \]

where \(\mathrm{IDCG@}k\) is the DCG@k of the ideal ranking (items sorted by decreasing \(y\)). NDCG is typically in \([0,1]\) when gains are non-negative.

2) Intuition (plots)#

We’ll use a toy query with 6 items.

We’ll compare two rankings:

  • a good model puts high-gain items near the top

  • a bad model pushes good items down

We’ll visualize:

  • per-rank gain, discount, and contribution

  • cumulative DCG curves

  • DCG@k as a function of the cutoff \(k\)

doc_ids = np.array([f"doc_{i}" for i in range(6)])

# Gains (true relevance). In sklearn, these are used directly (no gain transform).
y_true_1q = np.array([3, 2, 3, 0, 1, 2], dtype=float)

# Two different score vectors (both induce a ranking).
y_score_good = np.array([0.90, 0.70, 0.80, 0.05, 0.30, 0.60])
y_score_bad = np.array([0.10, 0.20, 0.30, 0.95, 0.80, 0.70])

Y_true = y_true_1q[None, :]

print("sklearn DCG (good):", sk_dcg_score(Y_true, y_score_good[None, :]))
print("sklearn DCG (bad): ", sk_dcg_score(Y_true, y_score_bad[None, :]))
print("sklearn NDCG (good):", sk_ndcg_score(Y_true, y_score_good[None, :]))
print("sklearn NDCG (bad): ", sk_ndcg_score(Y_true, y_score_bad[None, :]))

for k in [1, 3, 5]:
    print(f"DCG@{k} (good): {sk_dcg_score(Y_true, y_score_good[None, :], k=k):.4f}")
    print(f"DCG@{k} (bad):  {sk_dcg_score(Y_true, y_score_bad[None, :], k=k):.4f}")
sklearn DCG (good): 7.140995184095699
sklearn DCG (bad):  4.765286603584786
sklearn NDCG (good): 0.9999999999999998
sklearn NDCG (bad):  0.6673140760825534
DCG@1 (good): 3.0000
DCG@1 (bad):  0.0000
DCG@3 (good): 5.8928
DCG@3 (bad):  1.6309
DCG@5 (good): 7.1410
DCG@5 (bad):  3.6967
def dcg_rank_contributions(y_true_1d, y_score_1d, *, k=None, log_base=2, doc_ids=None):
    """Return rank-wise DCG components for a single query.

    This ignores tie-averaging and is meant for visualization.
    """
    y_true_1d = np.asarray(y_true_1d, dtype=float)
    y_score_1d = np.asarray(y_score_1d, dtype=float)
    if y_true_1d.shape != y_score_1d.shape:
        raise ValueError("y_true_1d and y_score_1d must have the same shape")

    n_labels = y_true_1d.size
    order = np.argsort(y_score_1d)[::-1]

    gains = y_true_1d[order]
    scores = y_score_1d[order]

    ranks = np.arange(1, n_labels + 1)
    discount = 1.0 / (np.log(ranks + 1) / np.log(log_base))
    if k is not None:
        discount[k:] = 0.0

    contrib = gains * discount
    cum_dcg = np.cumsum(contrib)

    if doc_ids is None:
        doc_ids = np.arange(n_labels)
    doc_ids = np.asarray(doc_ids)

    return {
        "rank": ranks,
        "doc": doc_ids[order],
        "gain": gains,
        "score": scores,
        "discount": discount,
        "contrib": contrib,
        "cum_dcg": cum_dcg,
    }
def plot_dcg_breakdown(components, *, title="DCG breakdown"):
    ranks = components["rank"]

    fig = make_subplots(
        rows=2,
        cols=1,
        shared_xaxes=True,
        vertical_spacing=0.14,
        specs=[[{"secondary_y": True}], [{"secondary_y": True}]],
    )

    fig.add_trace(
        go.Bar(
            x=ranks,
            y=components["gain"],
            name="gain (y_true)",
            hovertemplate="rank=%{x}<br>gain=%{y}<br>doc=%{customdata[0]}<br>score=%{customdata[1]:.3f}<extra></extra>",
            customdata=np.c_[components["doc"], components["score"]],
        ),
        row=1,
        col=1,
        secondary_y=False,
    )
    fig.add_trace(
        go.Scatter(
            x=ranks,
            y=components["discount"],
            mode="lines+markers",
            name="discount",
            hovertemplate="rank=%{x}<br>discount=%{y:.3f}<extra></extra>",
        ),
        row=1,
        col=1,
        secondary_y=True,
    )

    fig.add_trace(
        go.Bar(
            x=ranks,
            y=components["contrib"],
            name="gain × discount",
            hovertemplate="rank=%{x}<br>contrib=%{y:.3f}<extra></extra>",
        ),
        row=2,
        col=1,
        secondary_y=False,
    )
    fig.add_trace(
        go.Scatter(
            x=ranks,
            y=components["cum_dcg"],
            mode="lines+markers",
            name="cumulative DCG",
            hovertemplate="rank=%{x}<br>cum_dcg=%{y:.3f}<extra></extra>",
        ),
        row=2,
        col=1,
        secondary_y=True,
    )

    fig.update_xaxes(title_text="rank (1 = top)", row=2, col=1)
    fig.update_yaxes(title_text="gain", row=1, col=1, secondary_y=False)
    fig.update_yaxes(title_text="discount", row=1, col=1, secondary_y=True)
    fig.update_yaxes(title_text="contribution", row=2, col=1, secondary_y=False)
    fig.update_yaxes(title_text="cumulative DCG", row=2, col=1, secondary_y=True)
    fig.update_layout(title=title, legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0))

    return fig


good = dcg_rank_contributions(y_true_1q, y_score_good, k=None, doc_ids=doc_ids)
bad = dcg_rank_contributions(y_true_1q, y_score_bad, k=None, doc_ids=doc_ids)

plot_dcg_breakdown(good, title="Good ranking: per-rank DCG contributions").show()
plot_dcg_breakdown(bad, title="Bad ranking: per-rank DCG contributions").show()
def cumulative_dcg_curve(y_true_1d, y_score_1d, *, log_base=2):
    comps = dcg_rank_contributions(y_true_1d, y_score_1d, k=None, log_base=log_base)
    return comps["rank"], comps["cum_dcg"]


r, dcg_good = cumulative_dcg_curve(y_true_1q, y_score_good)
_, dcg_bad = cumulative_dcg_curve(y_true_1q, y_score_bad)
_, dcg_ideal = cumulative_dcg_curve(y_true_1q, y_true_1q)  # ideal ordering

fig = go.Figure()
fig.add_trace(go.Scatter(x=r, y=dcg_good, mode="lines+markers", name="good"))
fig.add_trace(go.Scatter(x=r, y=dcg_bad, mode="lines+markers", name="bad"))
fig.add_trace(go.Scatter(x=r, y=dcg_ideal, mode="lines+markers", name="ideal (IDCG)"))

fig.update_layout(
    title="Cumulative DCG vs rank",
    xaxis_title="rank (1 = top)",
    yaxis_title="cumulative DCG",
)
fig.show()
ks = np.arange(1, len(y_true_1q) + 1)
dcg_good_k = [sk_dcg_score(Y_true, y_score_good[None, :], k=int(k)) for k in ks]
dcg_bad_k = [sk_dcg_score(Y_true, y_score_bad[None, :], k=int(k)) for k in ks]
idcg_k = [sk_dcg_score(Y_true, Y_true, k=int(k), ignore_ties=True) for k in ks]
ndcg_good_k = [g / i if i > 0 else 0.0 for g, i in zip(dcg_good_k, idcg_k)]
ndcg_bad_k = [g / i if i > 0 else 0.0 for g, i in zip(dcg_bad_k, idcg_k)]

fig = make_subplots(rows=1, cols=2, subplot_titles=["DCG@k", "NDCG@k"], shared_xaxes=True)

fig.add_trace(go.Scatter(x=ks, y=dcg_good_k, mode="lines+markers", name="good"), row=1, col=1)
fig.add_trace(go.Scatter(x=ks, y=dcg_bad_k, mode="lines+markers", name="bad"), row=1, col=1)

fig.add_trace(go.Scatter(x=ks, y=ndcg_good_k, mode="lines+markers", name="good"), row=1, col=2)
fig.add_trace(go.Scatter(x=ks, y=ndcg_bad_k, mode="lines+markers", name="bad"), row=1, col=2)

fig.update_xaxes(title_text="k (cutoff)")
fig.update_yaxes(title_text="score")
fig.update_layout(title="Effect of cutoff k", legend=dict(orientation="h", yanchor="bottom", y=1.08, xanchor="left", x=0))
fig.show()

3) NumPy implementation (including tie averaging)#

We’ll implement a drop-in equivalent of scikit-learn’s dcg_score:

  • works on y_true and y_score shaped (n_samples, n_labels)

  • supports k, log_base, sample_weight, and ignore_ties

  • matches scikit-learn tie-averaging when ignore_ties=False

def _as_2d(a, *, name):
    a = np.asarray(a)
    if a.ndim == 1:
        a = a[None, :]
    if a.ndim != 2:
        raise ValueError(f"{name} must be 1D or 2D, got shape {a.shape}")
    return a


def _discount_vector(n_labels, *, k=None, log_base=2):
    if log_base <= 1:
        raise ValueError("log_base must be > 1")

    ranks0 = np.arange(n_labels)
    discount = 1.0 / (np.log(ranks0 + 2.0) / np.log(log_base))
    if k is not None:
        if not (1 <= int(k) <= n_labels):
            raise ValueError(f"k must be in [1, {n_labels}], got {k}")
        discount[int(k) :] = 0.0
    return discount


def _tie_averaged_dcg_1d(y_true_1d, y_score_1d, discount_cumsum):
    """DCG for one sample by averaging over all permutations of tied scores.

    Matches sklearn.metrics._ranking._tie_averaged_dcg.
    """
    _, inv, counts = np.unique(-y_score_1d, return_inverse=True, return_counts=True)

    mean_gain_per_group = np.zeros(len(counts), dtype=float)
    np.add.at(mean_gain_per_group, inv, y_true_1d)
    mean_gain_per_group /= counts

    groups_end = np.cumsum(counts) - 1

    discount_sums = np.empty(len(counts), dtype=float)
    discount_sums[0] = discount_cumsum[groups_end[0]]
    discount_sums[1:] = np.diff(discount_cumsum[groups_end])

    return float((mean_gain_per_group * discount_sums).sum())


def _dcg_sample_scores_numpy(y_true, y_score, *, k=None, log_base=2, ignore_ties=False):
    y_true = np.asarray(y_true, dtype=float)
    y_score = np.asarray(y_score, dtype=float)
    if y_true.shape != y_score.shape:
        raise ValueError(f"y_true and y_score must have the same shape, got {y_true.shape} vs {y_score.shape}")

    n_samples, n_labels = y_true.shape
    discount = _discount_vector(n_labels, k=k, log_base=log_base)

    if ignore_ties:
        ranking = np.argsort(y_score, axis=1)[:, ::-1]
        ranked_true = y_true[np.arange(n_samples)[:, None], ranking]
        return discount @ ranked_true.T

    discount_cumsum = np.cumsum(discount)
    out = np.empty(n_samples, dtype=float)
    for i in range(n_samples):
        out[i] = _tie_averaged_dcg_1d(y_true[i], y_score[i], discount_cumsum)
    return out


def dcg_score_numpy(y_true, y_score, *, k=None, log_base=2, sample_weight=None, ignore_ties=False, return_per_sample=False):
    """NumPy implementation of sklearn.metrics.dcg_score."""
    y_true = _as_2d(y_true, name="y_true")
    y_score = _as_2d(y_score, name="y_score")

    scores = _dcg_sample_scores_numpy(y_true, y_score, k=k, log_base=log_base, ignore_ties=ignore_ties)

    if return_per_sample:
        return scores

    if sample_weight is None:
        return float(np.mean(scores))
    sample_weight = np.asarray(sample_weight, dtype=float)
    if sample_weight.shape != (scores.shape[0],):
        raise ValueError(f"sample_weight must have shape ({scores.shape[0]},), got {sample_weight.shape}")
    return float(np.average(scores, weights=sample_weight))


def ndcg_score_numpy(y_true, y_score, *, k=None, sample_weight=None, ignore_ties=False, return_per_sample=False):
    """NumPy implementation of sklearn.metrics.ndcg_score (built from DCG)."""
    y_true = _as_2d(y_true, name="y_true")
    y_score = _as_2d(y_score, name="y_score")

    gain = _dcg_sample_scores_numpy(y_true, y_score, k=k, log_base=2, ignore_ties=ignore_ties)
    ideal = _dcg_sample_scores_numpy(y_true, y_true, k=k, log_base=2, ignore_ties=True)

    out = np.zeros_like(gain, dtype=float)
    mask = ideal != 0
    out[mask] = gain[mask] / ideal[mask]

    if return_per_sample:
        return out

    if sample_weight is None:
        return float(np.mean(out))
    sample_weight = np.asarray(sample_weight, dtype=float)
    if sample_weight.shape != (out.shape[0],):
        raise ValueError(f"sample_weight must have shape ({out.shape[0]},), got {sample_weight.shape}")
    return float(np.average(out, weights=sample_weight))
# Sanity checks vs scikit-learn
Yt = rng.integers(0, 4, size=(50, 12))
Ys = rng.normal(size=(50, 12))

for ignore_ties in [True, False]:
    for k in [None, 5]:
        for base in [2, 10]:
            sk = sk_dcg_score(Yt, Ys, k=k, log_base=base, ignore_ties=ignore_ties)
            np_ = dcg_score_numpy(Yt, Ys, k=k, log_base=base, ignore_ties=ignore_ties)
            print(f"ignore_ties={ignore_ties} k={k} base={base} | abs diff={abs(sk - np_):.12f}")

# A tie-heavy example (by rounding scores)
Ys_ties = np.round(Ys, 1)
sk = sk_dcg_score(Yt, Ys_ties, k=5, ignore_ties=False)
np_ = dcg_score_numpy(Yt, Ys_ties, k=5, ignore_ties=False)
print("\nTie-heavy check | sklearn:", sk, "numpy:", np_, "abs diff:", abs(sk - np_))

# The classic tied-top example from sklearn docs
true_relevance = np.asarray([[10, 0, 0, 1, 5]])
scores = np.asarray([[1, 0, 0, 0, 1]])
print("\nTies at the top (k=1)")
print("sklearn (tie-avg): ", sk_dcg_score(true_relevance, scores, k=1))
print("numpy  (tie-avg): ", dcg_score_numpy(true_relevance, scores, k=1))
print("sklearn (ignore):  ", sk_dcg_score(true_relevance, scores, k=1, ignore_ties=True))
print("numpy  (ignore):  ", dcg_score_numpy(true_relevance, scores, k=1, ignore_ties=True))
ignore_ties=True k=None base=2 | abs diff=0.000000000000
ignore_ties=True k=None base=10 | abs diff=0.000000000000
ignore_ties=True k=5 base=2 | abs diff=0.000000000000
ignore_ties=True k=5 base=10 | abs diff=0.000000000000
ignore_ties=False k=None base=2 | abs diff=0.000000000000
ignore_ties=False k=None base=10 | abs diff=0.000000000000
ignore_ties=False k=5 base=2 | abs diff=0.000000000000
ignore_ties=False k=5 base=10 | abs diff=0.000000000000

Tie-heavy check | sklearn: 4.726341544169852 numpy: 4.726341544169852 abs diff: 0.0

Ties at the top (k=1)
sklearn (tie-avg):  7.5
numpy  (tie-avg):  7.5
sklearn (ignore):   5.0
numpy  (ignore):   5.0

4) Using DCG/NDCG to optimize/tune a simple model#

DCG depends on sorting, so it’s not smoothly differentiable in the usual sense. In practice, you often:

  1. optimize a differentiable surrogate loss (pointwise/pairwise/listwise), and evaluate with NDCG/DCG

  2. or use a black-box optimizer to directly maximize NDCG/DCG

Here we’ll do (2) with a simple linear scoring model:

\[ s(x) = w^\top x \]

and optimize \(w\) by coordinate ascent to maximize mean NDCG@k on a toy ranking dataset.

def make_synthetic_ranking_data(*, n_queries=400, n_docs=12, n_features=6, noise=0.7, seed=0):
    rng = np.random.default_rng(seed)
    w_true = rng.normal(size=n_features)

    X = rng.normal(size=(n_queries, n_docs, n_features))
    raw = X @ w_true + noise * rng.normal(size=(n_queries, n_docs))

    # Convert raw scores to graded relevance per query using quantiles.
    q50 = np.quantile(raw, 0.50, axis=1, keepdims=True)
    q75 = np.quantile(raw, 0.75, axis=1, keepdims=True)
    q90 = np.quantile(raw, 0.90, axis=1, keepdims=True)

    y = np.zeros_like(raw, dtype=int)
    y += raw >= q50
    y += raw >= q75
    y += raw >= q90

    return X, y.astype(float), w_true


X, y_true, w_true = make_synthetic_ranking_data()
X_train, X_val, y_train, y_val = train_test_split(X, y_true, test_size=0.30, random_state=0)

print("X_train:", X_train.shape, "y_train:", y_train.shape)
print("relevance levels in y:", np.unique(y_true))
X_train: (280, 12, 6) y_train: (280, 12)
relevance levels in y: [0. 1. 2. 3.]
def mean_ndcg_at_k(y_true, y_score, *, k=5):
    return ndcg_score_numpy(y_true, y_score, k=k, ignore_ties=False)


def coordinate_ascent_ndcg(
    X_train,
    y_train,
    X_val,
    y_val,
    *,
    k=5,
    n_passes=30,
    step=0.5,
    step_decay=0.9,
    init_scale=0.2,
    seed=0,
):
    rng = np.random.default_rng(seed)
    n_features = X_train.shape[-1]

    w = rng.normal(scale=init_scale, size=n_features)
    w0 = w.copy()

    def score(X, w):
        return X @ w

    def evaluate(X, y, w):
        return mean_ndcg_at_k(y, score(X, w), k=k)

    history = {
        "pass": [],
        "step": [],
        "train_ndcg": [],
        "val_ndcg": [],
    }

    best_val = -np.inf
    best_w = w.copy()

    for p in range(n_passes):
        improved = False
        base_train = evaluate(X_train, y_train, w)

        for j in range(n_features):
            best_local = base_train
            best_delta = 0.0

            for delta in (+step, -step):
                w_try = w.copy()
                w_try[j] += delta
                val = evaluate(X_train, y_train, w_try)
                if val > best_local + 1e-12:
                    best_local = val
                    best_delta = delta

            if best_delta != 0.0:
                w[j] += best_delta
                base_train = best_local
                improved = True

        train_ndcg = evaluate(X_train, y_train, w)
        val_ndcg = evaluate(X_val, y_val, w)

        history["pass"].append(p)
        history["step"].append(step)
        history["train_ndcg"].append(train_ndcg)
        history["val_ndcg"].append(val_ndcg)

        if val_ndcg > best_val:
            best_val = val_ndcg
            best_w = w.copy()

        step *= step_decay
        if not improved:
            # No single-coordinate improvement found at this step size.
            break

    return w0, best_w, history


w0, w_best, hist = coordinate_ascent_ndcg(X_train, y_train, X_val, y_val, k=5, n_passes=40, step=0.6, step_decay=0.9, seed=1)

print("initial val NDCG@5:", mean_ndcg_at_k(y_val, X_val @ w0, k=5))
print("best    val NDCG@5:", mean_ndcg_at_k(y_val, X_val @ w_best, k=5))
print("true    val NDCG@5:", mean_ndcg_at_k(y_val, X_val @ w_true, k=5))
initial val NDCG@5: 0.33609233877721406
best    val NDCG@5: 0.8092268103302992
true    val NDCG@5: 0.8086706878351096
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist["pass"], y=hist["train_ndcg"], mode="lines+markers", name="train NDCG@5"))
fig.add_trace(go.Scatter(x=hist["pass"], y=hist["val_ndcg"], mode="lines+markers", name="val NDCG@5"))

fig.update_layout(
    title="Coordinate ascent on NDCG@5 (linear scoring model)",
    xaxis_title="pass",
    yaxis_title="NDCG@5",
)
fig.show()
# Visualize one query: relevance by predicted rank (before vs after)
q = 0
yq = y_val[q]
s0 = (X_val[q] @ w0)
s1 = (X_val[q] @ w_best)

comp0 = dcg_rank_contributions(yq, s0, k=5)
comp1 = dcg_rank_contributions(yq, s1, k=5)

fig = make_subplots(rows=1, cols=2, subplot_titles=["initial ranking", "optimized ranking"], shared_yaxes=True)

fig.add_trace(go.Bar(x=comp0["rank"], y=comp0["gain"], name="gain"), row=1, col=1)
fig.add_trace(go.Bar(x=comp1["rank"], y=comp1["gain"], name="gain"), row=1, col=2)

fig.update_xaxes(title_text="rank (1 = top)")
fig.update_yaxes(title_text="gain (true relevance)")
fig.update_layout(title=f"One validation query: gains by predicted rank (k=5)")
fig.show()

print("Query DCG@5 (initial): ", dcg_score_numpy(yq, s0, k=5))
print("Query DCG@5 (optimized):", dcg_score_numpy(yq, s1, k=5))
print("Query NDCG@5 (initial): ", ndcg_score_numpy(yq, s0, k=5))
print("Query NDCG@5 (optimized):", ndcg_score_numpy(yq, s1, k=5))
Query DCG@5 (initial):  1.0
Query DCG@5 (optimized): 4.517782560805999
Query NDCG@5 (initial):  0.14902422012004707
Query NDCG@5 (optimized): 0.6732590227960631

5) Pros, cons, pitfalls#

Pros#

  • Ranking-first: directly measures whether relevant items are near the top.

  • Top-heavy: logarithmic discount matches many UX settings (users rarely scroll far).

  • Graded relevance: naturally supports 0/1/2/3 relevance, not just binary.

  • Cutoff-friendly: DCG@k focuses evaluation on the top-\(k\) you actually show.

Cons#

  • Not normalized: raw DCG depends on the gain scale and number of items; comparing across queries is hard (use NDCG).

  • Non-smooth objective: depends on sorting; direct gradient-based optimization is non-trivial.

  • Ignores calibration: only the ordering matters (monotonic transforms of scores keep DCG unchanged unless they create ties).

Common pitfalls#

  • Mixing queries: DCG is defined per query (a list). Averaging over unrelated items can be meaningless.

  • Ties: if your model outputs few distinct scores, tie-handling matters; ignore_ties=True can be wrong.

  • Gain definition: if you expect exponential gain (e.g., \(2^y-1\)), you must transform y_true yourself.

  • Negative gains: NDCG may fall outside \([0,1]\) if y_true contains negative values.

Where DCG/NDCG is especially useful#

  • search result ranking

  • recommendation lists

  • multilabel problems where you care about the ordering of labels

Exercises#

  1. Modify dcg_score_numpy to accept a gain_fn and compare linear vs exponential gains.

  2. Try different values of k and observe how the coordinate ascent solution changes.

  3. Create a dataset where optimizing a pointwise loss (e.g., MSE on gains) does not improve NDCG@k much. Why?

References#

  • J\u00e4rvelin, K., & Kek\u00e4l\u00e4inen, J. (2002). Cumulated gain-based evaluation of IR techniques.

  • scikit-learn: sklearn.metrics.dcg_score, sklearn.metrics.ndcg_score